Project 3: Aim 1

Prediction of Gene Function using Phylogenetic Trees
IMAGE Retreat 2023

George G. Vega Yon
Paul Thomas
Paul Marjoram
Huaiyu Mi
Christopher Williams

2023-06-05

You can download the slides from https://ggv.cl/image-retreat2023

Recap

Starting point

“Bayesian Parameter Estimation for Automatic Annotation of Gene Functions Using Observational Data and Phylogenetic Trees” – (G. G. Vega Yon et al. 2021)

using phylogenetic trees to predict gene function.

  • Simple model using Felsenstein’s pruning algorithm.
  • Assumes that genes’ functions evolve independently.

Models in the project

The key difference between the models is how they model the transition from parent to offspring: \(\mathbb{P}\left(x_n\to x_o\right)\)

aphylo

  • Fixed gain/loss rates.

  • Full independence between genes/functions.

aphylo2

  • Event-specific gain/loss rates.

  • Full independence between genes/functions.

GEESE

  • Event-specific gain/loss rates.

  • Jointly distributed model.

Evol. of Gene fun. (multiple functions) Tapping into Evol. Theory

  • A fundamental part of Fun. Evol. is Duplication Events.
  • Furthermore, knowing what happened to gene A (e.g., neofunctionalization) is highly informative to learn about the functional state of B.

A key part of molecular innovation, gene duplication provides an opportunity for new functions to emerge (wikimedia)

GEESE: GEne functional Evolution using SuficiEncy

Model fully implemented in C++ and R. It already shows great promise:

Challenges

Rough edges

We have only discussed Sub Aim 2

  • Sub Aim 1 is supposed to deal with the Hierarchical Bayesian Framework.

  • Some work is done, but we need a leader for this.

Data challenges

  • Most trees aren’t fully-reconciled: GEESE’s complexity grows exponentially with the number of offspring in the tree.

  • Negative assertions (gene NOT associated with) are rare… but Christopher has made good progress using taxon constraints.

Discussion

Goals

Manuscripts

  • The GEESE paper is about 80% done. We need towill finish it this year.

  • Low-hanging fruit: aphylo2 can be submitted to PLOS Comp. Bio. using aphylo (a software prototype is up and running).

Collaboration (ideas)

  • Augment -omics data with GEESE (à la carte): Use GEESE on a gene list to make predictions, then use those predictions as additional -omics data.

Bonus: Mechanistic ML (prelim res.)

Not mentioned in the original grant, but we could add it to (any) the project.

References

Engelhardt, B. E., M. I. Jordan, J. R. Srouji, and S. E. Brenner. 2011. “Genome-Scale Phylogenetic Function Annotation of Large and Diverse Protein Families.” Genome Research 21 (11): 1969–80. https://doi.org/10.1101/gr.104687.109.
Vega Yon, G. G., D. C. Thomas, J. Morrison, H. Mi, P. D. Thomas, and P. Marjoram. 2021. “Bayesian Parameter Estimation for Automatic Annotation of Gene Functions Using Observational Data and Phylogenetic Trees.” PLoS Comput Biol 17: e1007948. https://doi.org/10.1371/journal.pcbi.1007948.
Vega Yon, George G., Mary Jo Pugh, and Thomas W. Valente. 2022. “Discrete Exponential-Family Models for Multivariate Binary Outcomes.” arXiv. https://arxiv.org/abs/2211.00627.
Engelhardt, B. E., M. I. Jordan, J. R. Srouji, and S. E. Brenner. 2011. “Genome-Scale Phylogenetic Function Annotation of Large and Diverse Protein Families.” Genome Research 21 (11): 1969–80. https://doi.org/10.1101/gr.104687.109.
Vega Yon, G. G., D. C. Thomas, J. Morrison, H. Mi, P. D. Thomas, and P. Marjoram. 2021. “Bayesian Parameter Estimation for Automatic Annotation of Gene Functions Using Observational Data and Phylogenetic Trees.” PLoS Comput Biol 17: e1007948. https://doi.org/10.1371/journal.pcbi.1007948.
Vega Yon, George G., Mary Jo Pugh, and Thomas W. Valente. 2022. “Discrete Exponential-Family Models for Multivariate Binary Outcomes.” arXiv. https://arxiv.org/abs/2211.00627.